fourier component
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Middle East > Israel (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Pre-trained Large Language Models Use Fourier Features to Compute Addition
Zhou, Tianyi, Fu, Deqing, Sharan, Vatsal, Jia, Robin
Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.
Mechanistic Interpretability of Binary and Ternary Transformers
Recent research (arXiv:2310.11453, arXiv:2402.17764) has proposed binary and ternary transformer networks as a way to significantly reduce memory and improve inference speed in Large Language Models (LLMs) while maintaining accuracy. In this work, we apply techniques from mechanistic interpretability to investigate whether such networks learn distinctly different or similar algorithms when compared to full-precision transformer networks. In particular, we reverse engineer the algorithms learned for the toy problem of modular addition where we find that binary and ternary networks learn similar algorithms as full precision networks. This provides evidence against the possibility of using binary and ternary networks as a more interpretable alternative in the LLM setting.
Progress measures for grokking via mechanistic interpretability
Nanda, Neel, Chan, Lawrence, Lieberum, Tom, Smith, Jess, Steinhardt, Jacob
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.
Learning in RKHM: a $C^*$-Algebraic Twist for Kernel Machines
Hashimoto, Yuka, Ikeda, Masahiro, Kadri, Hachem
Supervised learning in reproducing kernel Hilbert space (RKHS) has been actively investigated since the early 1990s (Murphy, 2012; Christmann & Steinwart, 2008; Shawe-Taylor & Cristianini, 2004; Schölkopf & Smola, 2002; Boser et al., 1992). The notion of reproducing kernels as dot products in Hilbert spaces was first brought to the field of machine learning by Aizerman et al. (1964), while the theoretical foundation of reproducing kernels and their Hilbert spaces dates back to at least Aronszajn (1950). By virtue of the representer theorem (Schölkopf et al., 2001), we can compute the solution of an infinite-dimensional minimization problem in RKHS with given finite samples. In addition to the standard RKHSs, applying vector-valued RKHSs (vvRKHSs) to supervised learning has also been proposed and used in analyzing vector-valued data (Micchelli & Pontil, 2005; Álvarez et al., 2012; Kadri et al., 2016; Minh et al., 2016; Brouard et al., 2016; Laforgue et al., 2020; Huusari & Kadri, 2021). Generalization bounds of the supervised problems in RKHS and vvRKHS are also derived (Mohri et al., 2018; Caponnetto & De Vito, 2007; Audiffren & Kadri, 2013; Huusari & Kadri, 2021).
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
Diffusion Variational Autoencoders
Rey, Luis A. Pérez, Menkovski, Vlado, Portegies, Jacobus W.
A standard Variational Autoencoder, with a Euclidean latent space, is structurally incapable of capturing topological properties of certain datasets. To remove topological obstructions, we introduce Diffusion Variational Autoencoders with arbitrary manifolds as a latent space. A Diffusion Variational Autoencoder uses transition kernels of Brownian motion on the manifold. In particular, it uses properties of the Brownian motion to implement the reparametrization trick and fast approximations to the KL divergence. We show that the Diffusion Variational Autoencoder is capable of capturing topological properties of synthetic datasets. Additionally, we train MNIST on spheres, tori, projective spaces, SO(3), and a torus embedded in R3. Although a natural dataset like MNIST does not have latent variables with a clear-cut topological structure, training it on a manifold can still highlight topological and geometrical properties.
Stability of the Stochastic Gradient Method for an Approximated Large Scale Kernel Machine
Samareh, Aven, Parizi, Mahshid Salemi
In this paper we measured the stability of stochastic gradient method (SGM) for learning an approximated Fourier primal support vector machine. The stability of an algorithm is considered by measuring the generalization error in terms of the absolute difference between the test and the training error. Our problem is to learn an approximated kernel function using random Fourier features for a binary classification problem via online convex optimization settings. For a convex, Lipschitz continuous and smooth loss function, given reasonable number of iterations stochastic gradient method is stable. We showed that with a high probability SGM generalizes well for an approximated kernel under given assumptions.We empirically verified the theoretical findings for different parameters using several data sets.
- North America > United States > Washington > King County > Seattle (0.14)
- Asia (0.04)